[Feat] Adds LongCat-AudioDiT pipeline #13390
[Feat] Adds LongCat-AudioDiT pipeline #13390RuixiangMa wants to merge 11 commits intohuggingface:mainfrom
Conversation
Signed-off-by: Lancer <maruixiang6688@gmail.com>
9c4613f to
d2a2621
Compare
src/diffusers/models/autoencoders/autoencoder_longcat_audio_dit.py
Outdated
Show resolved
Hide resolved
src/diffusers/models/autoencoders/autoencoder_longcat_audio_dit.py
Outdated
Show resolved
Hide resolved
| ) | ||
|
|
||
|
|
||
| def _pixel_shuffle_1d(hidden_states: torch.Tensor, factor: int) -> torch.Tensor: |
There was a problem hiding this comment.
Similarly, I think we should inline _pixel_shuffle_1d in UpsampleShortcut following #13390 (comment).
src/diffusers/models/autoencoders/autoencoder_longcat_audio_dit.py
Outdated
Show resolved
Hide resolved
src/diffusers/models/autoencoders/autoencoder_longcat_audio_dit.py
Outdated
Show resolved
Hide resolved
src/diffusers/models/autoencoders/autoencoder_longcat_audio_dit.py
Outdated
Show resolved
Hide resolved
src/diffusers/models/autoencoders/autoencoder_longcat_audio_dit.py
Outdated
Show resolved
Hide resolved
src/diffusers/models/autoencoders/autoencoder_longcat_audio_dit.py
Outdated
Show resolved
Hide resolved
src/diffusers/models/autoencoders/autoencoder_longcat_audio_dit.py
Outdated
Show resolved
Hide resolved
src/diffusers/models/autoencoders/autoencoder_longcat_audio_dit.py
Outdated
Show resolved
Hide resolved
src/diffusers/models/autoencoders/autoencoder_longcat_audio_dit.py
Outdated
Show resolved
Hide resolved
src/diffusers/models/transformers/transformer_longcat_audio_dit.py
Outdated
Show resolved
Hide resolved
src/diffusers/models/transformers/transformer_longcat_audio_dit.py
Outdated
Show resolved
Hide resolved
src/diffusers/models/transformers/transformer_longcat_audio_dit.py
Outdated
Show resolved
Hide resolved
src/diffusers/models/transformers/transformer_longcat_audio_dit.py
Outdated
Show resolved
Hide resolved
src/diffusers/models/transformers/transformer_longcat_audio_dit.py
Outdated
Show resolved
Hide resolved
| self.time_embed = AudioDiTTimestepEmbedding(dim) | ||
| self.input_embed = AudioDiTEmbedder(latent_dim, dim) | ||
| self.text_embed = AudioDiTEmbedder(dit_text_dim, dim) | ||
| self.rotary_embed = AudioDiTRotaryEmbedding(dim_head, 2048, base=100000.0) | ||
| self.blocks = nn.ModuleList( |
There was a problem hiding this comment.
| self.time_embed = AudioDiTTimestepEmbedding(dim) | |
| self.input_embed = AudioDiTEmbedder(latent_dim, dim) | |
| self.text_embed = AudioDiTEmbedder(dit_text_dim, dim) | |
| self.rotary_embed = AudioDiTRotaryEmbedding(dim_head, 2048, base=100000.0) | |
| self.blocks = nn.ModuleList( | |
| self.time_embed = AudioDiTTimestepEmbedding(dim) | |
| self.input_embed = AudioDiTEmbedder(latent_dim, dim) | |
| self.text_embed = AudioDiTEmbedder(dit_text_dim, dim) | |
| self.rotary_embed = AudioDiTRotaryEmbedding(dim_head, 2048, base=100000.0) | |
| self.blocks = nn.ModuleList( |
See #13390 (comment).
src/diffusers/models/transformers/transformer_longcat_audio_dit.py
Outdated
Show resolved
Hide resolved
| batch_size = hidden_states.shape[0] | ||
| if timestep.ndim == 0: | ||
| timestep = timestep.repeat(batch_size) | ||
| timestep_embed = self.time_embed(timestep) | ||
| text_mask = encoder_attention_mask.bool() | ||
| encoder_hidden_states = self.text_embed(encoder_hidden_states, text_mask) |
There was a problem hiding this comment.
| batch_size = hidden_states.shape[0] | |
| if timestep.ndim == 0: | |
| timestep = timestep.repeat(batch_size) | |
| timestep_embed = self.time_embed(timestep) | |
| text_mask = encoder_attention_mask.bool() | |
| encoder_hidden_states = self.text_embed(encoder_hidden_states, text_mask) | |
| batch_size = hidden_states.shape[0] | |
| if timestep.ndim == 0: | |
| timestep = timestep.repeat(batch_size) | |
| timestep_embed = self.time_embed(timestep) | |
| text_mask = encoder_attention_mask.bool() | |
| encoder_hidden_states = self.text_embed(encoder_hidden_states, text_mask) |
Can you also refactor forward here so that it is better organized, following #13390 (comment)? See for example the QwenImageTransformer2DModel.forward method:
There was a problem hiding this comment.
Reorganized parts of forward incrementally; kept the current structure otherwise to avoid unnecessary behavioral churn.
Thx for the comments, I have made the changes, PTAL. |
src/diffusers/models/transformers/transformer_longcat_audio_dit.py
Outdated
Show resolved
Hide resolved
src/diffusers/models/transformers/transformer_longcat_audio_dit.py
Outdated
Show resolved
Hide resolved
src/diffusers/pipelines/longcat_audio_dit/pipeline_longcat_audio_dit.py
Outdated
Show resolved
Hide resolved
tests/models/transformers/test_models_transformer_longcat_audio_dit.py
Outdated
Show resolved
Hide resolved
|
|
||
| @slow | ||
| @require_torch_accelerator | ||
| def test_longcat_audio_pipeline_from_pretrained_real_local_weights(): |
There was a problem hiding this comment.
Can you refactor test_longcat_audio_pipeline_from_pretrained_real_local_weights to be part of a LongCatAudioDiTPipelineSlowTests class? For reference, see the Stable Diffusion 3 slow tests:
There was a problem hiding this comment.
refactored the real local-weights test into a LongCatAudioDiTPipelineSlowTests class.
dg845
left a comment
There was a problem hiding this comment.
Thanks for iterating! I left some follow-up comments.
Thx, PTAL |
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
|
@bot /style |
|
Style bot fixed some files and pushed the changes. |
|
These CI failures do not appear to be related to this PR |
What does this PR do?
Adds LongCat-AudioDiT model support to diffusers.
Although LongCat-AudioDiT can be used for TTS-like generation, it is fundamentally a diffusion-based audio generation model (text conditioning + iterative latent denoising + VAE decoding) rather than a conventional autoregressive TTS model, so i think it fits naturally into diffusers.
Test
Result
longcat.wav
Before submitting
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.